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I. INTRODUCTION 

Everyday sights and sounds are typically described with reference to the 
environmental object that produced them and not to the physical pattern of 
stimulation at the sensory receptor. Thus, we say that we see a house rather 
than an array of points and edges and that we hear a bell rather than a 
complex of inharmonic partials. This object-oriented view of perception 
has come to be known as object perception. In the case of vision the physical 
features of environmental objects map directly to patterns of stimulation on 
the retina. Quite naturally, then, the study of visual object perception con- 
centrates on revealing the details of further processing of the peripheral 
representation, on such issues as size and shape invariance under various 
transformations of the retinal image. In contrast, hearing offers no direct 
peripheral representation of environmental objects. All auditory sensory 
information is packaged in a pair of acoustical pressure waveforms, one at 
each ear. While there is obvious structure in these waveforms, that structure 
(temporal and spectral patterns) bears no simple relationship to the structure 
of the environmental objects that produced them. The properties of audi- 
tory objects and their layout in space must be derived completely from 
higher level processing of the peripheral input. Thus, many of the issues 
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central to the study of auditory object perception are different from those 
involved in visual object perception. 

The definition of what constitutes an auditory object is an issue of some 
controversy and considerable importance. Many acoustical waveforms 
evoke a mental reference to the source of the waveform. These are clearlv 
auditory objects. We hear a church bell, for example, or ice tinkling in a 
glass. We hear the objects themselves and are generally unaware of the 
spectral and temporal structure of those waveforms. However, reference to 
an identifiable physical object may not be a necessary condition for auditory 
“objectness. ” As we mention later, waveforms made up of sequences of 
pure tones can also contain what most would agree arc primitive auditory 
objects, even though no known physical object could have produced the 
sounds. 

Thar the study of auditory object perception is immature is reflected in 
the fact that there are few empirical data on the important issues. Thus, 
while we can be precise here in our descriptions of the physical features of 
auditory stimuli and somewhat certain about the details of the peripheral 
encoding of those features, discussion of the higher level processing that 
subserves auditory object formation and segregation must be speculative. In 
the context of our discussion of the spatial layout of auditory objects, for 
example, we can and do review the substantial body of evidence on the 
factors that determine the apparent spatial positions of single, static sound 
sources. However, since there are relatively few data on the perception of 
moving sources and virtually no data on perception of the spatial relations 
among auditory objects, our treatment of these important issues is limited 
to an analysis of the potential sources of information and does not attempt 
to address in detail the questions related to how those sources of information 
may be utilized. 

The chapter begins with a discussion of the peculiarities of acoustical 
stimuli and how they are received by the human auditory system. A distinc- 
tion is made, following Gibson (1966), between the ambient sound field and 
the effective stimulus to differentiate the perceptual distinctions among vari- 
ous simple classes of sound sources (ambient field) from the known percep- 
tual consequences of the linear transformations of the sound wave from 
source to receiver (effective stimulus). Next we deal briefly with the defini- 
tion of an auditory object, specifically the question of how the various 
components of a sound stream become segregated into distinct auditory 
objects. The remainder of the chapter focuses on issues related to the spatial 
layout of auditory objects. Stationary objects are considered first. Since 
much of the material relevant to this subject has recently been reviewed 
elsewhere (e.g. , Middlebrooks Sc Green, 1991; Wightman Sc Kistler, 1993), 
the section concentrates on topics not covered in those previous reports. 
The sources of information related to the apparent distance of an auditory 
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object is one such topic. The spatial layout of moving auditory objects is 
discussed next, and in this context we offer a detailed treatment of the 
acoustics of moving sound sources. A distinction between source move- 
ment and observer movement is made to draw attention to the possible role 
of proprioceptive feedback in the perception of auditory spatial layout. The 
chapter concludes with a brief treatment of experimental evidence on the 
importance of input from other senses (vision, primarily) in establishing 
auditory spatial layout. 

II. ACOUSTICAL INFORMATION: THE AMBIENT SOUND 
FIELD AND THE EFFECTIVE STIMULUS 

As we use the term here, information is an abstract construct that serves as the 
bridge between an organism and its environment. It has a structure that is 
not related to the characteristics of either the transmitting medium or the 
receptor surface. For example, the "squareness" of a visual object is spe- 
cified by information (e.g., relationships among visual patterns) that is not 
defined in terms of the physics of light or the anatomy and physiology of 
the retina. In the case of auditory objects, the mechanical events that pro- 
duce them have lawful acoustical consequences in the sound patterns that arc 
represented to the peripheral auditory system. If chose patterns map in a 
one-to-one or many-to-one fashion onto the object properties, then they 
constitute information that potentially specifies those properties. In princi- 
ple, then, for any physical property of an environmental object to be recov- 
erable by an organism there must be information available to the perceiver 
that specifies that property. 

The specific property of auditory objects that is of interest here is spatial 
layout. The information about auditory spatial layout is acoustically con- 
veyed, and thus the stimulus that must be decoded by the perceiver to 
determine spatial layout is a sound wave. There is information about spatial 
layout contributed both by the specific type of sound wave that is generated 
and by the transformations that sound waves undergo in their passage from 
the source to our ears. This section of the chapter provides an overview of 
the broad classes of simple sound sources and the characteristics of the 
waves they produce (the ambient field), and also in this section there is a 
detailed discussion of the source-to-receiver transformations that convey 
information about the spatial layout of the sound sources (the effective 
stimulus). 


A. The Ambient Sound Field 

Waves in general are important means by which information about a physi- 
cal event is conveyed to a perceiver. Discussion of wave generation and 
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propagation is beyond the scope of this chapter since both are extraordi- 
narily complex topics, especially in the case of naturally occurring physical 
events and natural environments. Simplifying assumptions are not only 
useful but mandatory for our purposes here. In the case of sound-producing 
events, a convenient assumption is that the sound is produced by a so-called 
point source, or acoustic monopole, and that the propagation equations are 
linear. Anv small object vibrating in a mass of fluid (air) has all the attributes 
of an acoustic monopole, provided the dimensions of the object are small 
relative to the sound wavelengths produced and the sound field of interest is 
several object lengths avvav. The sound field produced by a monopole is 
omnidirectional, that is, the same in any direction equidistant from the 
source. 

The sound fields produced by two or more simultaneously active mo- 
nopoles can be assumed to combine linearly. Thus, an acoustic dipole , a very 
common type of sound source in nature, can be described as the superposi- 
tion of two spatially separated monopole sources that are 180 out ot phase. 
In contrast with monopolc sources, which are omnidirectional, dipole 
sources have both magnitude and orientation. The structure of the dipole 
field can best be understood by considering the dipole in terms of its cancel- 
ing monopoles. The field has an angular dependence with no sound at all 
produced at 90° to the dipole axis where the sound fields of the constituent 
monopoles exactly cancel. 

The intensity of a sound wave (proportional to pressure squared per unit 
area) diminishes as the wave travels away from the source. Several factors 
are responsible for this. One that applies to all sound waves, including those 
proposed by monopoles and dipoles, is atmospheric absorption. Absorp- 
tion is the result of nonadiabatic propagation caused by temperature differ- 
entials between compressions and rarefactions in the propagating wave and 
in air depends on temperature, humidity, and wavelength. The attenuation 
coefficient in air at 20°C with 50% humidity is approximately 1 x 10*" w pi m, 
where /is frequency in Hz. For a monopole source, intensity also decreases 
with the inverse square of the distance from the source because the total 
acoustical power is spread out over the surface area of a sphere, the radius ot 
which is the distance from the source. When considering both geometrical 
spreading and absorption, the intensity (/) of a monopolar source as a 
function of distance can be written 


7(r) = 



— a r 


where r is the distance from the sound source, P is the total power produced 
by the source, and ot is the attenuation coefficient. Sometimes the term 
attenuation length, 1/ot, is used to describe the distance over which the inten- 
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suv decreases to Me. At short distances the decrease in intensity with dis- 
tance is dominated by spherical spreading, whereas at distances well beyond 
the attenuation length, absorption is dominant. 

The intensity of the sound field produced by a dipole decreases some- 
what differently with distance. For a dipole field it is simplest to discuss the 
decrease in pressure (proportional to the square root of intensity). The 
equation governing the pressure decrease is complicated, but its essential 
elements arc a magnitude and a direction component. The magnitude part 
has two terms: one decreasing with the inverse square of distance, and the 
other linearly. The inverse square dependence dominates the field near the 
source, and the linear component dominates at large distances. 

The characteristics of sound radiation, whether modeled as a monopole 
or as a dipole, may contribute significant information to aid source identi- 
fication and to determine spatial layout. As described above, monopoles 
radiate sound evenly in all directions, but dipoles have a figure-eight direc- 
tivity pattern. While the compression and rarefaction components cancel in 
a plane perpendicular to the dipole axis, a pressure gradient docs exist in the 
field near the source that may be useful for tracking a sound source. An 
example of a dipole source that we are particularly interested in tracking is a 
flying insect near our car. There are also more complex sources in nature 
that can be modeled as the sum of several constituent dipoles. 


B. The Effective Stimulus 

For our purposes here the effective stimulus is defined in terms of the 
acoustical pressure waveforms produced by an ambient sound field as they 
exist just before transduction at the listener’s eardrums. For simplicity we 
assume that the ambient field is produced by one or more acoustical mo- 
nopoles. The relationship between the ambient field and the effective stimu- 
lus is defined by a series of linear transformations of the acoustical wave- 
form that incorporate a number of potential sources of information about 
the spatial layout of sound sources in the environment. In this section of the 
chapter we identify the relevant transformations and describe the spatial 
information that each incorporates. In a later section we examine in detail 
the evidence on whether the information is perceptually relevant. 

The acoustics of the local environment that includes the source and the 
listener contribute several potentially important sources of information 
about spatial layout. For example, because of the long wavelengths and slow 
propagation velocity of sound, the reflections and diffractions of an emitted 
sound wave off the walls, floor, ceiling, and contents of a typical room 
enrich the ambient sound field considerably. There is information about the 
size of the room in the timing of the reflections, information about the wall 
coverings and contents in the pattern of reverberation, and information 
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about the distance between source and listener in the ratio ot direct to 
reflected sound. If long distances arc involved, such as m large rooms or in 
open spaces, the high-frequency content of the effective stimulus is reduced 
by atmospheric absorption. There is ample evidence that all ot these etteets 

are detectable bv a normal-hearing listener. . , 

The listener's shoulders, head, and outer ear structures (especially the 
pinnae) are significant components of the local acoustical environment an 
as such contribute additional information relevant to auditory spatial layout. 
The pattern of reflections and diffractions of an incident sound wave ott 
these structures is heavily dependent on the direction from which the soun 
arrives, and thus, the information contributed by these effects relates pri- 
marily to the direction of auditory objects. The pinnae, in particular, are 
hiehlv directional, modifying incident sound waves in ways that are specific 
to each different angle of incidence. As in the case ot room effects, there is 

ample evidence of the detectability of pinna effects. 

The fact that we have two ears separated by an acoustically opaque hea 
suggests that information about auditory spatial layout may come from 
three sources: the effective stimulus at the left ear, the effective stimulus at 
the right ear. and the difference. These are clearly not independent sources 
of information. However, there are reasons to believe that all are important. 
Information from the difference signal, for example, is uniquely indepen- 
dent of the characteristics of the source, and because of the insensitivity o 
the auditory svstem to the absolute timing of events, this is the only source 
of information on the direction-dependent difference in the time-of-arrival 
of an acoustic waveform. Because of the approximate lateral symmetry ot 
the head, interaural difference information is ambiguous. Interaural time 
difference, for example, is the same for sources in the front and sources in 
comparable positions (on the same side of the head, and at the same ang es 
relative to the interaural axis) in the rear. Information from each of the 
individual ears can potentially resolve these ambiguities. 

The information relevant to auditory spatial layout that is contained in 
the effective stimuli at the two ears can be described as either tempora or 
spectral patterns. At a formal mathematical level the two descriptions are 
isomorphic, so one might think the choice is arbitrary. However, when 
higher level processing of the information is considered, the distinction 
becomes important because temporal and spectral processing mechanisms 
in the auditory system are thought to be so different. For this reason, we 
discuss temporal and spectral separately. Because of the auditory system s 
relative insensitivity to monaural phase (the phase spectrum of a stimulus at 
one ear), our discussion of temporal information concentrates on inceraura 
time differences and the temporal patterns of room reflections. Interaura 
phase, defined as the difference between the phase spectra of the left and 
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right car stimuli, is relevant only when considering single-frequency com- 
ponents of a stimulus. Our discussion of the spectral information in effec- 
tive auditory stimuli focuses on the direction-dependent changes in the 
magnitude components ot the complex sourcc-to-eardrum transformation. 

III. AUDITORY OBJECTS 

It seems obvious that before any discussion of the rules that govern the 
spatial layout of auditory objects, we should know what an auditory object 
is. Unfortunately, there is little consensus on what might constitute a satis- 
factory definition of an auditory object nor on what alternative terms might 
be better. One alternative that has been proposed is sound event (Blauert. 
1983), but this term seems to refer more directly to a disturbance of the 
ambient sound field than to any aspect of the perception of that disturbance. 
Another alternative is sound stream (Bregman, 1990), but this term does not 
convey the obviously close association between everyday auditory stimuli 
and the environmental objects that produced them. The term auditor y object 
is borrowed from the field of visual perception in which the features of 
environmental objects map directly to features of the effective stimulus, a 
pattern of light on the retina. Its use in auditory perception is less satisfying 
since there is no straightforward mapping of object features to stimulus 
features. Nevertheless, the fact that auditory percepts in daily life arc so 
naturally and immediately associated with the objects that produced the 
sounds is undeniable and gives currency, if not clarity, to the term auditory 
object. 

The effective stimulus at each ear consists of a one-dimensional acoustical 
pressure waveform. This waveform contains the superposition of the acous- 
tic outputs from all of the objects in the listener’s environment. A complete 
understanding of what constitutes an auditory object would therefore in- 
clude specification of the rules, whereby the various components of the 
single-pressure waveform are segregated into discrete auditory objects. 
These rules are the object of considerable current interest in the auditory 
research community (e.g., Bregman, 1990; Handel, 1989), and it is not our 
purpose to summarize them here. Rather, we focus on the contributions to 
this segregation process offered by spatial separation. For the purposes of 
our discussion, it may be helpful to distinguish between two kinds of audi- 
tory objects: concrete and abstract. Concrete auditory objects are formed by 
sounds emitted by real objects in the environment. Although experimental 
data are scarce, segregation of concrete objects seems to be primarily deter- 
mined by spatial and temporal rules. Abstract auditory objects do not often 
correspond to real environmental objects. They consist typically of more 
primitive sound elements and are formed by simpler frequency and tempo- 
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rai relations. There has been considerable research on the rules governing 
the formation of abstract auditory objects (e.g., Bregman, 1990). We con- 
centrate here exclusively on concrete auditory objects. 


IV. SPATIAL LAYOUT OF STATIONARY AUDITORY OBJECTS 

Much of the experimental literature on auditory spatial layout concerns the 
accuracy with which the spatial position o t a sound-producing object is 
indicated to a listener, that is, the degree of correspondence between the 
actual position of the object and its apparent position. It is our view that 
experiments that focus on accuracy can fail to consider other important 
features of the auditory percept. For example, consider experiments on 
monaural listening. The results generally show that the apparent positions 
of auditory objects are strongly biased toward the interaural axis and the 
side of the functioning ear. However, those same results are often reported 
as indicating that monaural localization accuracy is near normal on the side 
of the functioning ear and progressively poorer off the interaural axis on 
that side. The emphasis on accuracy obscures the fact that in monaural 
listening all of the sounds appear to emanate from one place. For reasons 
such as this, we prefer to ignore the accuracy component of spatial layout 
altogether, and we discuss only the factors that govern the apparent spatial 
positions of auditory objects. 

The apparent spatial position of an auditory object is defined by its 
apparent direction and its apparent distance relative to the listener. The 
potential sources of information for apparent direction and the stimulus 
features that appear to govern apparent direction have extensively and re- 
cently been discussed elsewhere (Middlebrooks &: Green, 1991; Wightman 
& Kistler, 1993). Therefore, the material on apparent direction is only sum- 
marized here. Much less attention has been paid to apparent distance, and 
although data are scarce, they are covered in some detail in this chapter. 


A* Acoustical Sources of Information about Static Spatial Layout 

The spatial position of each sound-producing object in a listener’s environ- 
ment is specified by several acoustical sources of information that for brevi- 
ty we call cues. Many of the cues are a result of the interactions of the sound 
waves with the listener’s head and pinnae. These interactions are conve- 
niently summarized by a linear transformation, the so-called head-related 
transfer Junction (HRTF), which represents the changes in the amplitude and 
phase of the sound wave from the sounding object’s position to the listener s 
eardrum. Mathematically, HRTFs are usually specified in terms of the 
sound wave’s spectrum. Thus, if X(ju>) is the source spectrum (j is the 
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complex operator and cu is angular frequency) and Y{j<a) is the spectrum ot 
the waveform at the eardrum, then the HRTF. H(jui), is given by 


H(J W ) 


Yjjw) 
X(jtv) ' 


( 1 ) 


More generally, since the HRTF varies with source direction and distance 
and thus is different at each ear, we must write two equations for H(j<x>): one 
for the left ear and one for the right car. Each depends on source azimuth 
(0), elevation (6), and distance ( d ) relative to the listener: 


H,(0. 4>. d,ju>) 


y,(0. <J>, d. i u>) 

xo«) ’ 


( 2 ) 


and 


H r (0. 4>. d,j u>) 


y r ( 0 . 6, d , » 

X (jw) 


( 3 ) 


All of the information about sound source position arc represented in the 
pair of HRTFs shown above. These HRTFs vary in complicated ways with 
changes in source position, so simplifying assumptions must be made to 
appreciate the essential elements. Two convenient assumptions arc that the 
acoustical space enclosing the source and listener is anechoic and that the 
listener’s head is spherical wich pinna-less ears at opposite ends of a diameter 
of the sphere. The anechoic assumption allows the main effect of distance to 
be modeled as a simple attenuation of 6 dB for every doubling of distance 
from the source. The spherical head assumption leads to a greatly simplified 
account of the effects of diffraction of the sound wave around the head. 
Figure 1 illustrates the latter point. When ignoring the details for a moment 
(the spherical model is described in detail in Kuhn, 1977), we see that at each 
ear variations in source azimuth (or elevation, not shown in the figure) can 
be expected to produce mainly variations in effective stimulus intensity, a 
result of the head shadow effect when the source is on the opposite side of the 
head from the ear under consideration. The head shadow effect can be 
expected to be much larger at high frequencies than at low frequencies. This 
is because at low frequencies sound wavelengths would be long with respect 
to the dimensions of the head, and thus the sound waves would travel 
around the head without attenuation. The covariacion of stimulus intensity 
with azimuth (and elevation) that occurs at each ear individually can be 
viewed as a potential monaural cue to sound source position. Figure 1 also 
illustrates the potential binaural cues to sound source position that arc offered 
by intcraural differences (defined by the ratio of the two HRTFs). Note that 
for all source azimuths other than 0° and 180°, the acoustical path from 
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FIGURE 1 Schematic top-down representation ot a listener and a sound source. The 
source is assumed to be sufficiently fir from the listener such that the acoustical wavefronts arc 
planar, and the listener is assumed to have a spherical head with ears at oppos.te ends a 




source to ear has a different length for the two ears. This path-length differ- 
ence produces a small difference in the time of arrival of the sound wave at 
the two ears. The interaural time difference (1TD) varies systematically w it 
source azimuth and is largest for azimuths of +90° and -90 .In addition, 
because of the head shadow effect mentioned earlier, there will be an mtcr- 
aural level difference (ILD) that varies with azimuth in roughly the same 
way as ITD and is large at high frequencies and small at low frequencies 
The utility of monaural cues is compromised by the fact that some or a 
features of the sound source waveform must be known for the cue to e 
unambiguous. In the simple spherical head case described above, while 
stimulus intensity at a given ear varies systematically with source azimuth, a 
listener with access only to the effective stimulus at that ear would have no 
wav of knowing whether a weak stimulus was produced by a source on the 
opposite side of the head or by a weak source. In more general terms, note 
that (from Equation 3) the effective stimulus at one ear, say the nght ear. is 
defined by the product of the source spectrum and the HRTF: 

V,(0.<M.>») = X(jo))H r (6,4 >,d,». W 

Thus, even if a listener had perfect memory for the HRTF at each and every 
possible source position, a given effective stimulus could unambiguousK 
indicate a specific source position only if the source spectrum were known. 

Binaural cues to source position are derived from the ratio of the trans- 
duced representations of the two effective stimuli. Thus, the utility of these 
cues does not require knowledge of the source spectrum since that term 
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appears in both numerator and denominator and hence cancels. Neverthe- 
less. to the extent that the spherical head model is accurate, binaural cues arc 
also ambiguous. Note, as shown in Figure l. that the difference in acoustical 
path length from the source to the two ears, which gives rise to the IT D. is 
the same for sources in front and in the rear. A source at an azimuth of 30°. 
for example, would produce the same ITD as a source at 150° azimuth. The 
same could be said for ILDs and for sources at complementary positions 
above and below the horizontal plane. In tact, the spherical head model 
predicts conical surfaces projecting outward from the ears along which ITD 
and ILD are constant and thus along which cues that are based on ITD and 
ILD would be ambiguous. These are the so-called cones of conpsion. We 
should mention here that cone-of-confusion ambiguities could be resolved 
by head movements, as Wallach (1940) pointed our in his now-classic trea- 
tise on the issue. If a listener knew both the direction of movement ot the 
head and the direction of change of the ITD or ILD cue, the direction ot the 
sound source could be derived without ambiguity. 

Detailed measurements of human HRTFs (Middlebrooks &: Green, 1990; 
Middlebrooks, Makous, Sc Green, 1989; Pralong & Carlile, 1994; Shaw, 
1974; Wightman & Kisder, 1989a) provide a complete catalog of the poten- 
tial acoustical cues to apparent sound position and highlight the limitations 
of the spherical head model. The most prominent features of HRTFs not 
anticipated by the spherical head model are the directional filtering charac- 
teristics of the pinnae and the large Iistener-to-listener differences in 
HRTFs. The multiple ridges and cavities of the pinna produce resonant 
peaks and antiresonant notches in the magnitude response of the HRTF. 
The frequencies at which these peaks and notches appear are dependent on 
sound source direction and thus could serve as potential spatial position 
cues, provided some a priori information about the sou ce was available. 
Figure 2 shows an example of how the frequency of a given notch in the 
HRTF changes with sound source elevation. HRTFs from two listeners arc 
shown in this figure to illustrate individual differences. Note that while the 
general characteristics of the notches are the same from listener to listener, 
the frequencies at which the notches appear are highly listener dependent. 

The spherical head model provides a reasonably accurate prediction of 
the ITDs derived from actual HRTF measurements. Figure 3 shows ITDs 
from the horizontal plane HRTFs of a representative listener estimated by 
Wightman and Kistler (1989a). Also plotted in the figure are the ITDs 
predicted by 

ITD = j c (0 + sin0), (5) 

where 0 is the azimuth angle as in Figure 1, c is the velocity of the sound 
wave (cm/s), and d is the interaural distance (cm), chosen for this example 
to fit the HRTF data shown. While this equation is usually cited as repre- 
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FIGURE 2 Directional transfer functions from two listeners produced by a source at 90° 
azimuth. Directional transfer functions (DTFs) arc head-related transfer functions (HRTFs) 
divided by the root-mean-square average of the FIRTFs from all spatial positions measured. 
Thus, the DTFs represent the deviation in dB from the average response of the ear. (Adapted 
with permission from Wightman and Kistler, 1993.) 
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FIGURE 3 Intcrjural time differences (ITDs), produced by a source at 0° elevation, pre- 
dicted bv the spherical head model (solid line) and measured from a typical listener by using a 
wideband correlation technique. (Reproduced with permission from Wightman and Kistler. 
19V3.) 


scnting the predictions of the spherical head model (e.g., Green. 1976; 
Woodworth, 1938), it is really just a first-order approximation (Kuhn, 
1977). Nevertheless, as Figure 3 shows, it provides an accurate representa- 
tion of horizontal plane ITDs. Figure 4 (from Wightman & Kistler, 1993) 
shows a more complete set of ITD data from the same listener. This figure 
also shows the contours of constant ITD, which for the spherical head 
model would be circular. Clearly, the spherical head model provides a good 
first-order approximation to measured ITDs. Just as clearly, ITD is an 
ambiguous cue to sound source direction since any given ITD signals not 
one but a whole locus of potential directions. 

Interaural level differences derived from HRTF measurements are com- 
plicated functions of frequency at each and every source direction, a situa- 
tion caused at least in part by pinna filtering effects. Figure 5 shows ILD 
functions derived from a single listener’s HRTF measurements at a source 
elevation of 0 and azimuths of 0° and 90°. Note that even for a source on the 
median plane (0° azimuth), where ILDs would result only from interaural 
asymmetries, ILDs are large enough (greater than 0.5 dB, the ILD thres- 
hold) to be considered potential sources of information about source posi- 
tion. For a source at 90° ILDs are generally much larger, especially at high 
frequencies as would be expected because of head shadowing. 

The elaborate frequency dependence of ILDs complicates our discussion 
of them as potential cues to sound source position. We can discuss the 
interaural level cue either as an interaural spectral difference , referring to the 
entire pattern of ILDs across frequency, or as ILD averaged across one or 
more frequenev bands. Figure 6 illustrates the latter approach. In the upper 
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FIGURE 4 Interaura] time differences (ITDs) from head-related transfer function (HRTF) 
measurements from a typical listener plotted as a function of the azimuth and elevation of the 
sound source. Note the contours of constant ITD below the surface plot. (Adapted with 
permission from Wightman and Kistler, 1993.) 



FIGURE 5 Interaura] level difference (ILD) as a function of frequency from a typical 
listener, produced by a source at 0° elevation and 0® azimuth (dashed line) or 90° azimuth (solid 
line). 
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FIGURE 6 Imeraural level difference (ILDs) from a typical listener in different frequency- 
regions. Figure 6a shows ILDs across the entire frequency spectrum, and Figures 6b and 6c 
show ILD in two high-frequency critical bands. (Adapted with permission from Wightman 
and Kiscler, 1993.) 


panel we show one extreme, ILD averaged across the entire frequency 
spectrum. The bottom panels illustrate the other extreme, ILDs in two 
high-frequency critical bands. Note that the general pattern of ILD as a 
function of sound source direction is the same regardless of the bandwidth 
over which ILD is considered or the center frequency of the band. Note also 
that the general pattern of ILDs is the same as the pattern of ITDs, showing 
a similar kind of cone-of-confusion ambiguity. Thus, unless a listener could 
analyze the idiosyncratic details of ILD patterns in narrow bands, ILD infor- 
mation could not be used to disambiguate errors resulting from dependence 
on ITDs, and vice versa. As mentioned above, information provided by 
head movements can, in theory, offer such disambiguation. 

The acoustical sources of information about the distance of a sound- 
producing object are not well understood. Nor have they been well docu- 
mented by systematic measurements. In an anechoic environment, the two 
most obvious stimulus features that depend on distance are overall level and 
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spectral content. Overall level decreases by 6 dB for every doubling of the 
distance between the source and the listener (the inverse square law), and 
atmospheric absorption gradually attenuates the high-frequency compo- 
nents of a sound as the distance between source and listener is increased 
(about 2 dB/100 ft at 6 kHz and 4 dB/100 ft at 10 kHz). The utility of both 
of these monaural cues, of course, depends on knowledge of source charac- 
teristics. However, the requirement for a priori knowledge about the source 
can be eliminated if the perceiver is allowed two or more “looks” at the 
stimulus from different vantage points. For example, Lambert (1974) 
pointed out that just two looks at stimulus intensity, as might be obtained if 
the perceiver’s head were rotated, would provide sufficient information for 
a determination of source distance, without the need for knowledge of 
source characteristics. 

There are two potential binaural distance cues: ITD and ILD; both vary 
slightly with the distance between source and listener (Coleman, 1963). In 
the case of ITD, for a source at 90° azimuth, there can be as much as a 150 jas 
difference in the ITD produced by a near source and a far source. A near 
source produces a larger ITD than a far source. This change in ITD with 
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distance occurs because with a source close to the head the extra distance 
around the head is greater than if the source were far from the head. Dis- 
tance affects ILDs in a comparable way, although in this case the effect is 
highly frequency dependent. At low frequencies the distance efFect is great- 
est. For a 300-Hz tone at 90° azimuth, for example, the ILD for a source far 
from the head (several wavelengths) is about 0.5 dB, but for a source at 44 
cm it is over 10 dB. The effects at higher frequencies and at source azimuths 
off the intcraural axis are considerably smaller. 

In a nonancchoic environment, which of course includes nearly all every- 
day listening situations, there is an additional distance cue provided by the 
mix of the direct sound wave from source to listener with the reflections of 
that sound wave off the surfaces of the listening room. When the sound 
source is close to the head the direct sound dominates since because of the 
extra distance traveled and absorption at the surfaces, the level of the re- 
flected sound is always lower. However, as the source-to-listener distance 
increases, the direct sound level decreases, and the ratio of direct to reflected 
sound level decreases. Given a specific enclosure, then, this ratio is perfectly 
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correlated wich source-to-iistener distance. Moreover, even though it is a 
monaural cue, its validity does not depend on a priori knowledge ot stimu- 
lus characteristics. 


B. Acoustical Determinants of Apparent Spatial Position 

Our purpose in this section is to review what is currently known about how 
the acoustical information about the spatial position of stationary sources is 
actually used. Most of the experiments in this area have considered apparent 
source direction and apparent distance separately, and tor convenience we 
maintain this separation here. Several comprehensive reviews ot this area 
have appeared recently (Middlebrooks & Green, 1991; Wightman 8c Rustler, 
1993), so the material is only summarized here. 

In the vast majority of experiments on the apparent spatial position ot 
stationary auditory objects, only apparent direction (azimuth and elevation) 
has been considered. Until recently, the dominant theoretical position, epit- 
omized by the duplex theory (Strutt, 1907), was that ITD provided the 
dominant source of information about apparent direction at low frequencies 
and that 1LD was dominant at high frequencies. The duplex theory derived 
from the facts that the auditory system was much less sensitive to ITDs at 
high frequencies than at low frequencies (Joris & Yin, 1992; Yin & Chan, 
1988) and from the fact that ILDs are much larger at high frequencies than at 
low frequencies (see Figure 5). Information provided by pinna filtering was 
not considered in the duplex theory. 

Few empirical data on apparent source direction contradict the duplex 
theory. However, there are many natural circumstances that reveal the lim- 
itations of the theory and that argue for a situation-dependent weighting of 
the various sources of information about apparent sound direction. Local- 
ization of narrowband sounds is one such circumstance. Most narrowband 
sounds ofFer conflicting cues to apparent direction, so it is not surprising 
that they are not often localized accurately. The extreme case of a narrow- 
band sound is sinusoid. Sinusoids offer doubly ambiguous ITD cues. A 
1000-Hz sinusoid, for example, could provide a 400-p.s ITD leading to the 
right ear while at the same time indicating a 600- jxs ITD leading to the left 
ear. As Figure 4 shows, each ITD signals a whole range of potential source 
directions. It should not be surprising that unless a sinusoid has a broadband 
transient associated with onset or offset, its apparent position is unclear 
(Hartmann, 1983). Other narrowband sounds are somewhat less ambiguous 
but still inaccurately localized. The apparent azimuth of a high-frequency 
noise band is given by ILD, as suggested by the duplex theory (Mid- 
dlebrooks, 1992). However, the apparent elevation seems to be determined 
by a learned association between spatial position and the spectral peaks and 
valleys produced by pinna filtering (Middlebrooks, 1992). The resultant 
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apparent direction is often far removed from the actual source direction and 
well off the contour of directions indicated by ILD alone. In this case and 
others (e.g.. monaural localization, as described by Butler. Humanski, & 
Musicam, 1990), the learned association between spatial position and pinna 
filtering details appears to be a favored source of information about apparent 
sound direction. In general, the data suggest that in the absence of unam- 
biguous (i.e., wideband) ITD, the information provided bv pinna filtering 
appears to dominate. 

It a wideband source conrains both low and high frequencies, apparent 
direction seems to be governed primarily by ITD (Wightman &' Kistler, 
1992). In the Wightman and Kistler experiments, free-field noise sources 
were synthesized by using algorithms that were based on listeners' own 
HRTFs. The virtual sources were then presented by means of headphones, 
affording complete control over the acoustical stimulus. When the ITD 
information was manipulated to signal one direction and all other cues were 
lett to signal another direction, the listeners’ judgments of apparent direc- 
tion always followed the ITD cue. Thus, even in the presence of opposing 
ILDs of as much as 20 dB, ITD was dominant. The dominance of ITD 
occurred for all listeners so long as the stimuli contained energy below 
about 1500 Hz. When the low frequencies were filtered out, ITD was effec- 
tively ignored and judgments of apparent position followed the ILDs and 
pinna filtering cues. 

The importance of the ITD cue is further emphasized by the fact that 
listeners’ make frequent front-back confusions in certain conditions (Old- 
field Parker, 1984a, 1984b; Stevens &: Newman, 1936; Wenzel, Arruda. 
Kistler, &c Wightman, 1993; Wightman &: Kistler, 1989b). Recall that if 
apparent direction were governed by ITD, front-back confusions would be 
expected given the spherical symmetry of the head (Figure 4). While the rate 
of front-back confusions in everyday life is unknown, with laboratory 
stimuli and especially virtual source stimuli, front-back confusion rates can 
be as great as 25% (Oldfield 3c Parker, 1984a, 1984b; Wightman & Kistler, 
1989b). Contours of constant ITD from actual measurements arc smooch 
and regular, as predicted by the symmetry argument, though slightly dif- 
ferent for different listeners (Wightman & Kistler, 1993). Contours of con- 
stant ILD, on the other hand, are quite irregular and variable from one 
frequency band to another (Figure 6). We suggest that the fact that listeners 
make consistent and frequent front— back confusions argues at least indi- 
rectly for the dominance of ITD cues and the lesser importance of ILD and 
pinna filtering cues. 

The relative salience of the various acoustical cues to the spatial layout of 
auditory objects also depends on the “realism” of the cues. In experiments 
with virtual sources similar to those described above in which ITD was in 
conflict with other cues (Wightman &: Kistler, 1992), we have produced 
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stimuli in which cues in one frequency region conflict with cues in another 
frequency region. In one condition, for example, the ILD and spectral cues 
were the same throughout the frequency range (200 Hz-14000 Hz) and 
signaled, or “pointed to,” one of five possible directions on the horizontal 
plane The ITD cue in each of four bands (roughly 1.5 octaves wide) pointed 
to a different direction. Thus, the ITD cue could be said to be “inconsistent" 
across the frequency range, and the other cues could be said to be consis- 
tent." In other conditions, the ITD cue was consistent and the other cues 
were inconsistent, and in still other conditions, the frequency range was 
divided somewhat differently. The results were unambiguous. Listeners 
judgments always followed the consistent cue. Even if the ITD cue \sas 
inconsistent in a single high-frequency band (above 5 kHz) listeners ap- 
peared to ignore ITD and put maximum weight on the ILD and spectral 
cues that were consistent across the spectrum. Not only does this result 
suggest that high-frequency ITD cues arc encoded as well as low-frequency 
JTD cues, but it also suggests that cues that are realistic are given greater 
weight than unrealistic cues. With real sources and real listening environ- 
ments, it is highly unlikely that either the ITD or the other cues could be 

inconsistent across the frequency spectrum. 

The fidelity of the ITD, ILD, and spectral cues to spatial position is com- 
promised in most natural listening situations by the presence of echoes. 
These echoes, which to a first approximation are filtered copies of the sound 
wave, are produced when a sound wave bounces off objects or surfaces in 
the environment and because of the extra distance they have to travel they 
reach the listener slightly later than the original or direct sound wave. 
Typically, the intensities of the echoes are considerably weaker than the 
intensity of the direct sound, both because of the additional path length and 
because most objects and surfaces absorb some of the sound energy, partic- 
ularly at high frequencies. Nevertheless, when the echoes combine with the 
direct sound, the acoustical cues that signal the spatial position of the sound 
source are disrupted. With echoes the effective stimulus at each ear consists 
of the superposition of sounds from a number of different directions. Thus, 
both the monaural and binaural cues are distorted. 

It might be expected that the presence of echoes would seriously impair a 
listener’s ability to determine the spatial layout of sound sources. In fact, in 
all but the most extreme cases, the echoes are hardly noticed, and localiza- 
tion performance is not impaired (Begault, 1992; Hartmann, 1983). The 
substantial body of empirical data on this phenomenon can be summarized 
in the hypothesis that listeners attend only to the first few milliseconds of a 
stimulus, the time before echoes arrive, to determine the spatial position of a 
source. The spatial information arriving later, which would be corrupted by- 
echoes, is somehow suppressed. This is the well-known precedence effect 
(Clifton & Freyman, 1989; Wallach, Newman. & Rosenzweig, 1949; Zurek. 
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1080). Although many of the characteristics of the phenomenon and most 
of the underlying mechanisms are not well understood, it is clear that the 
precedence effect is of central importance to the determination of auditory 
spatial layouc in natural listening situations. 

Compared with our well-developed understanding ot how various 
sources of acoustical information arc combined to determine the apparent 
direction of auditory objects, relatively little is known about how listeners 
might form a judgment of apparent distance. Available evidence suggests 
that perception of auditory distance is not well developed in humans. Ap- 
parent distance is typically very different than real distance (e.g., Gardner, 
1968; Mershon &: King, 1975), and only relative distance can be determined 
with anv accuracy (Cochran, Throop, &: Simpson, 1968; Holt &: Thurlow, 
1969). While there are suggestions in the literature that the distances ot 
familiar sounds are judged more accurately (Coleman, 1962; McGregor. 
Horn, &: Todd, 1985), the classic demonstration by Gardner (1968) shows 
that in an ancchoic room with levels equalized, even the apparent distance ot 
speech is not accurately reported. The most reliable finding seems to be that 
sounds presented with reverberation are judged to be more distance than the 
same sounds presented without reverberation (e.g., Mershon &: King, 1975). 

From several different perspectives inaccuracies in judging the distance ot 
an auditory object arc not surprising. First, the primary acoustical correlates 
of distance, level, and spectrum are unambiguous only if the characteristics 
of the source arc known. Second, in everyday life the absolute distance of an 
auditory object carries little significance. Direction is clearly much more 
important, it serves to orient our gaze. Of course, if an auditory object is 
moving, and especially if that movement is toward the listener, distance 
carries considerable significance. Experiments on estimation of distance of a 
moving auditory object typically ask listeners to judge the time at which the 
object will reach to listener’s position, this is called time-to-cotitact. The 
available data on listeners* judgments of auditory time-to-contact is re- 
viewed in a later section of this chapter. 

V. SPATIAL LAYOUT OF DYNAMIC AUDITORY OBJECTS 

In evervdav life an individual’s auditory world is constantly in motion. The 
orientations of sound-producing objects with respect to a listener s head and 
ears are ever changing, either because the objects themselves arc moving or 
because the listener s head is moving. In either case, the result is a constantly 
changing pattern of directional cues at the ears and, if conditions arc right, 
the introduction of additional cues to movement such as the Doppler shift. 
This section of the chapter describes those additional movement cues in 
some detail, and we then discuss the available psychophysical data on lis- 
teners’ processing of dynamic spatial information. 



386 Frederic L. Wighrman and Rick Jenison 

A. Additional Acoustical Information from Moving Sounds 

Moving sounds can be described by using the mathematics of kinematics 
(Jenison & Lufti, 1992). Kinematics is the branch of mechanics chat describes 
pure motion that uses the variables of displacement, time, velocity, and 
acceleration. Doppler shifts, changes in ITD (described earlier) and inten- 
sity, can be shown to have dependencies that are based on kinematics. In 
addition to ITD, Doppler shift, and time-varying intensity, the first differ- 
entials of these observed variables may directly be sensed as well. Figure 7 
shows the geometry of the sound source moving relative to an observer, ip, 
is the angle of the incident wavefront at any time t and is dependent on the 
distance D, to a point p on the median plane. 0 (J is the angle at the anticipated 
closest point of approach (CPA), and (3 is the angle of the source trajectory 
relative to the median plane. Angle 3 is equivalent in magnitude to 0 M + 
tt/2. R t is the distance from the sound source to the observer. 

Movement of either the sound source or the observer changes the relative 
wavelength of the sound waves. This change is known as the Doppler shift. 
The well-known lawful dependence of the Doppler shift on velocity of the 
sound source relative to an observer is 

= 

** (1 — A/ cos <p f ) ’ 

where o> 0 is the intrinsic frequency, o> is the shifted frequency, M is the Mach 
number defined as velocity divided by the speed of sound, and is the 
angle of trajectory relative to the observer (see Figure 7). The frequency 
shift depends only on the velocity component directed toward the observer. 
This result holds true regardless of the time history of the trajectory. The 



FIGURE 7 Schematic diagram showing angular relations between a listener and a sound 
source that is moving along a straight path (represented by the arrow). 
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Doppler-shifted frequency at a given time and position is affected only by 
the source’s velocity and frequency at the instant the wave is generated. 
Furthermore, the source need not be traveling at a constant velocity or in a 
straight line for it to apply. When the sound source is far from the observer 
and approaching (tp, is small, thus cos[if r ] is near 1), the angle tp, changes 
very little, hence little change in the frequency shift. However, the magni- 
tude of the shift will be at its maximum. Since the sound source is approach- 
ing the observer, the shift is toward a higher frequency. As the sound source 
approaches the observer, tp f increases rapidly, resulting in a rapid decrease in 
frequency. As the sound source passes and recedes, there is a corresponding 
decrease in frequency relative to the intrinsic frequency of the sound source. 
This of course is the experience we have all had listening to a passing train 
whistle that decreases in pitch as it passes by and recedes into the distance. 

These observed variables, ITD, time-varying intensity, and Doppler, 
along with their first-order differentials with respect to time, all have char- 
acteristic spectrotcmporal patterns. Zakarauskas and Cynadcr (1991) an- 
alyzed intensity patterns for actual moving sound sources along various 
trajectories and derived mathematical expressions for the observed variables 
that arc related to the inverse-square distance relationship. Jenison (1994) 
extended these analyses to include Doppler and ITD patterns. The simplest 
trajectory is that of the rectilinear approach with constant velocity as shown 
in Figure 8. For illustration, the starting point for the moving sound source 
in these examples is located some distance R s directly on the median line as 
shown in the Figure 8. 

The characteristic patterns for the three sound source trajectory angles 
(P) of 90°, 120°, and 150° are shown in Figure 9. For the purpose of this 



FIGURE 8 Schematic diagram showing three example trajectories for a moving sound 
source. 
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FIGURE 9 Results ofkinemacic analysis of the interaural time difference (ITD); (a), inten- 
sity (b), and Doppler shift (c) cues produced by a moving sound source. The rates of change of 
those cues are shown in (b), (d), and (f). 


example, we have assumed a source of moderate intensity, a velocity of 5 
m/s, and a starting distance from the observer of 5 m. Note that all of the 
ITD functions begin at 0 delay because of the midline starting point. The 
intensity functions will also start at the same intensity for a given distance 
from the observer. In the case of the Doppler shift, the shift is toward a 
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higher frequency when the sound is approaching the observer and toward a 
lower frequency when receding. So for (3, equal to 90°, the frequency shift 
will start at unity and decline. For the cases of 0 2 and 03* where the source is 
initially approaching, passes through a CPA and then recedes, the frequency 
shift will initially be greater chan unity and then decline. 

Jenison (1994) has shown that acoustical kinematics sufficiently convey 
velocity (trajectory and speed) information regarding the moving sound 
source directly from the observed Doppler shift together with time-varying 
ITD. Although the theoretical analyses show that sufficient information is 
available to the observer regarding higher order variables such as the veloc- 
ity and time-to-contact of the moving sound source, it remains to be known 
whether the human observer has sufficient sensory mechanisms to detect 
this information, particularly under conditions of uncertainty. 

Most of the empirical research on perception of moving sound sources 
has focused, either directly or indirectly, on the question of whether dy- 
namic spatial changes are processed with some kind of specialized movement 
detectors. There is considerable neurophysiological evidence that differential 
information lawfully related to motion is directly detected by the visual 
system (Maunscll & VanEssen, 1983). Recent evidence suggests that there 
arc also direction-sensitive neurons spatially segregated in auditory cortex 
(Stumpf. Toronchuk &: Cynadcr, 1992). Other findings suggest that neural 
processing of auditory motion involves mechanisms distinct from those 
involved in processing stationary sound location (Spitzer & Semple, 1991, 
1993; Stumpf, Toronchuk, & Cynader, 1992). Thus, while converging 
physiological evidence supports the existence of motion sensitive neurons, 
the psychophysical evidence for specialized motion detectors is inconclu- 
sive. The two lines of research that have addressed this question involve 
measurements of the minimum audible movement angle (MAMA) and mea- 
surements of auditory motion aftereffects. 

The MAMA experiments are variations of the classical minimum audible 
angle (M AA) experiments conducted with stationary sources. They arc both 
detection or discrimination experiments that measure the threshold for dis- 
criminating small changes in spatial parameters. In the case ofMAAs, what 
is measured is the smallest spatial separation of two static sources that can 
reliably be detected. The MAMA represents the smallest amount of spatial 
displacement or movement of a single source that can reliably be detected. 
Although both experiments can inform us about the processing capabilities 
of the auditory system, it is important to note that since they involve 
discrimination or detection paradigms, the extent to which the results can 
be generalized to questions about apparent spatial position may be quite 
limited. In other words, that listeners can discriminate between two sources 
at slightly different spatial positions does not necessarily imply that the 
apparent positions of the sources were different. Similarly, discrimination 
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between a moving source and a static source does not necessarily imply that 
movement itself was perceived. 

While the investigators involved in the MAMA research may quibble 
over details, most would probably agree that the results do not support the 
existence of specialized motion detectors in the auditory system. Measured 
MAMAs, when expressed in terms of the total angle traversed at threshold, 
are roughly the same as or slightly larger than the MAAs measured with 
stationary sources, or about 2° (Grantham, 1986; Harris Sc Sergeant, 1971; 
Perrott Sc Musicant, 1977; Perrott Sc Tucker, 1988). A simple explanation of 
the basic MAMA results is that the listener takes an acoustic “snapshot ’ ot 
the position of the source at the beginning and end of its trajectory 
(Grantham, 1986) and discriminates on the basis of static positional changes. 
Not all the available data support this view, but the exceptions arc relatively 
minor (Perrott Sc Marlborough, 1989). 

Gibson (1966) took issue with the notion of a series of perceptual snap- 
shots, which requires fusion or composition to account for the perception of 
a single moving object. By redefining information for motion perception, 
Gibson eliminated the need for a concept such as fusion. Since motion 
information is available to the observer, even through discrete looks, the 
additional step of reconstruction to a continuous event is simply not neces- 
sary. To Gibson, the mechanics of the mediating sensory system were not 
germane to the perception of motion. To have “dynamic event perception, 
in contrast to the less elegant “motion perception plus inference,” it must be 
shown that even though dynamic properties, such as mass and inertia, are 
not present in the optic (or acoustic) array, they are specified by the kinema- 
tics. That is, the information regarding the physical motion of an object is 
conveyed through the kinematics, whether discrete or continuous. 

Research on motion aftereffects provides indirect evidence on the ques- 
tion of the existence of specialized morion detectors. The idea is that expo- 
sure to an adapting stimulus that is moving in one direction fatigues the 
neural elements that respond to movement in that direction. The aftereffect, 
a perception of movement in the opposite direction, is presumed to reflect 
the spontaneous activity of the neural elements sensitive to movement in the 
opposite direction. Movement aftereffects are common in vision, one varia- 
tion of which is called the waterfall illusion (Sekular Sc Pantle, 1967). 

Grantham (1989, 1992) has reported reliable though weak evidence for 
morion aftereffects in audition. After prolonged exposure to a free-field 
adapting stimulus that was moving in the horizontal plane, listeners’ judg- 
ments of the direction of movement of a subsequently presented probe 
stimulus were slightly biased in a direction opposite to that of the adapting 
stimulus. While the effects were disappointingly small, the results were 
nevertheless suggestive. 

Some of the research on perception of moving sound sources has been 
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less concerned with the existence of specialized motion detectors and more 
broadlv focused. For example, several studies have attempted to quantity 
the relative salience of the various sources of acoustical information thar 
signal source movement. These experiments ask listeners to indicate the 
time at which a moving source is closest to them (time to interception) or 
the time at which they would make contact with the source (acoustic tau). In 
a theoretical study, Shaw, McGowan, and Turvcy (1991) analyzed the 
acoustic intensity field produced by collincar relative movement between a 
sound source and an observer and showed the acoustic tau to be related to 
the inverse of the relative change in average intensity. Jcnison (1994) ex- 
tended the analysis to the more general case, including time-to-interception, 
showing that time-averaged intensity and time-varying ITD and their cor- 
responding first-order derivatives are sufficient for conveying both collision 
and interception information. 

Empirical studies of auditory time-to-concact or timc-to-interception in- 
clude research reported by Rosenblum, Carello, and Pastore (198/) in 
which listeners heard sound sources over headphones. Three stimulus vari- 
ables were manipulated: interaural time difference, overall level, and Dopp- 
ler shift. Each was presented both in isolation and in competition so that 
each indicated a different point of closest approach or interception. The 
results suggested that while any of the three stimulus parameters could 
accurately indicate point of closest approach, overall level was the dominant 
cue. The authors argue that overall level should be dominant since it is the 
only cue of the three that is, in all environmental circumstances, unequivo- 
cal. Todd (1981) investigated how well subjects could discriminate rime-to- 
contact for visual stimuli by simulating two simultaneously approaching 
objects on a computer display. Subjects were asked to judge which object 
would arrive first. We have recently launched analogous experiments that 
examine subjects’ ability to discriminate the arrival of two sound sources. 
Sounds were synthesized according to the simple kinematics of a moving 
sound composed of three harmonics by using ITD, average intensity, and 
Doppler shift. A sound arriving to the left of the listener was mixed with a 
sound arriving differentially in time to the right of the observer. Subjects 
were asked to choose which sound would arrive sooner. Figure 10 shows 
preliminary results from 24 subjects. In Todd’s experiment, relative rime- 
to-contact was 75% correctly discriminated when the difference in time-to- 
contact was about 50 ms. In contrast, the relative auditory rim e-to-con tact 
in our preliminary studies was 75% correctly discriminated when the differ- 
ence was about 300 ms. SchifF and Oldak (1990) examined observers accu- 
racy in using visual and acoustical estimates of rime-to-arrival from film and 
sound-recorded approaching vehicles. Their data indicate that sighted sub- 
jects were significantly more accurate in estimating rime-to-arrival with 
sight than sound, however, visually impaired subjects performed as well as 
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Time-io-Comact 



FIGURE 10 Average psychometric function from 24 listeners in the time-to-contact ex- 
periment. Percentage correct discriminations between two sounds arriving at different times is 
plotted as a function of the arrival time difference. 


or better than the sighted subjects with only the acoustic channel. Although 
the evidence is only suggestive at this point, human observers have the 
capacity to efficiently estimate relative time-to-contact regardless of how 
the information is conveyed as long as the temporal window for estimation 
is within several seconds. This restricted window should not be surprising 
given the pattern of the observables described above. Significant changes in 
ITD, intensity, and Doppler occur only in a spatial region (hence the tempo- 
ral region as well) about the CPA. This relationship holds for subtended 
angle in the visual domain as well. 

Head movements provide a somewhat different kind of dynamic audi- 
tory stimulus from movement of the sound source. Because head move- 
ments typically involve changes only in the direction of the sound source 
with respect to the head there is very little Doppler shift and very little 
change in overall level. However, interaural parameters change more rap- 
idly with head movements than with typical source movement. In addition, 
head movements provide additional information to the perceiver by means 
of proprioceptive feedback from the neck musculature. Although there has 
been speculation about the role of head movements for decades, there have 
been few empirical studies of their role (Pollack 8c Rose, 1967; Simpson &: 
Stanton, 1973; Thurlow 8c Runge, 1967). Only recently has empirical re- 
search begun to provide firm evidence of the importance of head move- 
ments for perception of the spatial layout of auditory objects. 
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Given a stationary auditory object in the environment there is a change in 
the angular relation of the object and a listener's head that accompanies 
normal head movement. This change in relative orientation produces a 
systematic and predictable change in the pattern of spatial cues (ITD, ILD, 
and spectral cues) produced by the object at the listener’s ears. If these 
normal changes in the spatial cues are disrupted, the apparent position of the 
auditory object is often disturbed. Young (1931) reported one of the first 
demonstrations of this phenomenon. In this experiment, sounds were 
routed to the ears through rubber tubes attached to fixed ear trumpets. With 
this arrangement the normal coupling between a listener’s head movements 
and changes in the acoustical stimulus at the ears was eliminated. Listeners 
reported all sounds as originating behind the head, outside of the listeners’ 
visual fields, regardless of the actual position of the sound source. Similar 
front-back confusions are reported in the modern studies of virtual sound 
sources that are synthesized and presented to listeners by means of head- 
phones (Wightman & Kistler, 1989b). 

As mentioned above, front-back confusions are not entirely unexpected 
given the rough spherical symmetry of the head and the salience of ITD 
cues. The idea that in everyday life a listener's head movements might 
provide the information needed to avoid them is usually attributed to Wal- 
lach (194(3). Wallach showed chat if a listener could monitor the direction of 
change in ITD that accompanied a head movement, the front-back ambi- 
guity could be avoided. For example, suppose a sound is presented at an 
azimuth of 45° and an elevation of 0° (on the horizontal plane, roughly 45° to 
the right of the median plane). A front-back confusion would be repre- 
sented by an apparent azimuth report of roughly 135°. If the listener’s head 
moved to the right, the ITD produced by the source initially at 45° would 
decrease because the angle of the source relative to the head would approach 
0°, the point of minimum ITD. However, if the source were actually at 135° 
azimuth, the ITD would have increased. Thus, the direction of change in 
ITD unambiguously indicates whether the source was in the front or in the 
rear. 

In spite of the simplicity and face validity of Wallach’s (1940) arguments, 
conclusive evidence that head movements are used to resolve front-back 
confusions has not appeared. One obvious reason for this is that experi- 
ments that control both head movements and the associated auditory stimu- 
lus dynamics have been technically too demanding until recently. Advanced 
technology now allows synthesis of virtual sources in such a way that the 
effects of head movements can directly be studied. Using magnetic head 
trackers and real-time convolution devices such as the Convolvotron (Fos- 
ter, Wenzel, 8c Taylor, 1991), one can monitor a listener’s head position 
continually during an experiment and adjust the synthesis algorithms dy- 
namically (20-40 times per second) to simulate a stationary source. As the 
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listener’s head moves, the device compensates for changes in the relative 
positions of the stationary virtual source and the head by using different 
letc-nghc pairs of HRTF-bascd filters for each updated head position. The 
movement compensation is smooth and the resultant percept of an external 
sound source in a stationary position is compellinglv realistic (Wenzel, 
1992). 

We have recently begun some research on the role of head movements 
that takes advantage of the new technology and attempts to clarifv some of 
the issues raised by the earlier work (Wightman, Kistler, &: Andersen, 
1994). The essential elements of the paradigm were as described in earlier 
work (Wightman &: Kistler, 1989b). Listeners localized virtual sources (2.5 s 
wideband noise bursts) in two conditions. In one, the virtual stimuli were 
presented over headphones with no head tracking, and the listeners were 
asked not to move their heads during the test. In the other, a magnetic head 
tracker was used to sense head position, and the virtual synthesis algorithms 
were modified in real time according to the head tracker’s reports. In the 
second condition, listeners were encouraged to move their heads during 
stimulus presentation if they felt it would facilitate localization. Apparent 
position judgments were made verbally after each stimulus presentation. 
Preliminary results from a single listener are shown in Figure 11. Note that 
in the head stationary condition this listener made frequent front-back 
confusions, as evidenced by the off-diagonal responses in the front-back 
panel. In the head-movement condition, however, the front-back confu- 
sions were nearly eliminated. The listeners’ gave no indication of other 
differences between the two conditions, either in their apparent position 
judgments or in their subjective reports. Thus, in contrast with suggestions 
in the literature, apparent source distance was the same with and without 
head movements (cf. Simpson & Stanton, 1973), and the images were 
equally well externalized in the two conditions (cf. Durlach et al., 1992). We 
conclude on the basis of these results that the primary role of head move- 
ments is resolution of confusions about the spatial layout of auditory ob- 
jects. 

VI. THE ROLE OF AUDITORY- VISUAL INTERACTIONS IN THE 
SPATIAL LAYOUT OF AUDITORY OBJECTS 

The sensory environment of most individuals includes both visual and audi- 
tory objects, and in many cases sound-producing objects can be seen as well 
as heard. Thus, while it is useful and informative to consider audition alone 
when discussing the spatial layout of auditory objects, it is important to be 
mindful of the potential role played by vision. Indeed, some auditory- 
visual interactions are quite powerful and their consequences well docu- 
mented. 
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FIGURE 11 Apparent source position judgments from a single listener in an experiment 
in which the listener heard virtual sources presented over headphones. In one condition (left 
panels) the listener was required to hold his or her head still, and in the other condition (right 
panels) the listener was encouraged to move his or her head and the virtual stimuli were 
modified in real time according to the listener’s head position to simulate a stationary external 
source. Each judgment of apparent azimuth and elevation is represented in three panels that 
reflea the extent (expressed as an angle from —90° to +90°) to which the judged position is on 
the right or left (top), in the front or back (middle), and above or below the horizontal plane 
(bottom). The darkness of each symbol represents the number of judgments that fell in the 
local area of the symbol. 
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The so-called ventriloquism effect is perhaps the best known of the auditory % 
-visual interactions (e.g., Pick, Warren, & Hay, 1969). The typical mani- 
festation of the efFcct is a strong biasing of the apparent position of an 
auditory object in the direction of a simultaneously present visual object. 
Evidence of the potency of this effect is familiar to anyone who has watched 
the image of someone speaking at the movies or on television. While the 
sound of the voice clearly seems to originate at the mouth of the person 
speaking, the actual source of the sound, a loudspeaker, is usually displaced 
far to one side. Clearly one’s perception of the spatial layout of auditory 
objects will be heavily influenced by whether or not the source of the sound 
is visible. 

Additional evidence for auditory-visual interactions comes from re- 
search on visual facilitation (e.g., Warren, 1970). Visual facilitation refers to 
the fact that the variance of localization judgments is lower when listeners 
hear the test stimulus in a lighted room than when they hear it in the dark. 
The source of sound is invisible in either case, and whether the listener 
makes the response in the light or the dark is irrelevant to the outcome. It is 
as if the listener is able to establish a frame of reference within which to 
place the auditory objects, and the presence of the frame of reference facili- 
tates localization. Some investigators argue that eye movements, even in the 
absence of visual input, are the basis of the facilitation effect (Jones & 
Kabanoff, 1975), but the issue is far from being resolved. What is especially 
interesting about the visual facilitation effect is that it occurs only in adults. 
Children as old as 12 years do not show the effect (Warren, 1970). 

VII. CONCLUSION 

The study of auditory object perception in general and the spatial layout of 
auditory objects in particular is in its infancy. In the case of the spatial layout 
of single stationary sound sources in anechoic space much is known about 
the sources of information and how that information is processed. The 
salience of ITD cues, the importance of monaural spectral cues derived from 
pinna filtering, the role of head movements, and so forth, have been thor- 
oughly documented in studies of single stationary sources. Relatively few 
investigators have ventured beyond the relative security of this constraint so 
that experiments involving nonanechoic listening conditions and moving 
sources are scarce, and studies of multiple sources are virtually nonexistent. 
The potential sources of information are reasonably well understood, but 
how that information might be used in the auditory system is completely 
unknown. 

The state of affairs in hearing contrasts sharply with the relative maturity 
of the study of visual spatial layout, in which research on such complex 
topics as optic flow has been in progress for decades. One reason for the 
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slower progress on the hearing side may be that the experiments are techni- 
cally more demanding. For example, it is easier to present an arbitrary 
visual pattern to a retina than an arbitrary sound waveform to an eardrum. 
Technology is changing this situation rapidly, so we can expect significant 
advances in our understanding of auditory object perception in the near 
future. 
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